-
Notifications
You must be signed in to change notification settings - Fork 566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: consider putting all rules into a single file to reduce IO #1212
Comments
Reopening to reconsider storing/loading the serialized RuleSet. |
we've been discussing the strategy of caching rulesets to improve the startup time of capa. we currently think that capa spends around five seconds loading rules before analyzing a sample. most of this time is validating rules and statements; this is a CPU task, not an IO task. we note that, most of the time, capa rules don't change from invocation to invocation. therefore, we hypothesize that introducing a persistent cache of the validated ruleset can speedup capa by 4-5 seconds. the design might look something like this:
we'll want to be careful around race conditions updating the cache. also, the pickle format can include code thats executed upon load, so we should be clear that the cache must be "trusted" somehow. if there's any issue processing the cache file, we should throw out the old one and regenerate it; worst case, the performance is about the same as today. there should be a way to disable the cache via CLI option, which is relevant for users running in docker or other transient environments. we expect the hit rate on the cache should be pretty high because we don't think that many capa users often modify their rules. for example, the standalone capa.exe will always use the embedded rules (unless overridden) so we could distribute a pre-built cache for capa.exe that benefits most users (we'll have to be very careful about versioning and probably cannot distribute pre-built caches for anything but capa.exe). |
Pickle is the easiest option vs. creating a custom JSON export format (likely several hundred LOC) for all objects embedded in a RuleSet. We should be clear about the threat model for code execution via a Pickled cache file: if users can't/don't share the cache file (as it resides in a mostly hidden location that's far away from capa code and rules) then it is hard to imagine the social engineering required for an attacker to convince a user to load a malicious cache file. However, this makes clear the point that we should not distribute pre-built cache files, except within the standalone binaries (where the fact a cache is present is not important at all). In general, users should not be aware there is a cache in use at all. Next steps:
|
cache file locationThe cache file should not be stored alongside capa code nor capa rules (it should be basically impossible to accidentally share). Users shouldn't stumble across the file unless they go looking for it. If the file is deleted accidentally or by the OS to reclaim disk/memory, that's totally ok. We should try to find existing work on the topic of cache files and follow conventions. Best choices so far:
filename: When writing the cache, create it as a temp file (using appropriate library) and then atomically move to its destination (after ensuring directory exists). we should try to avoid any race condition in cache file creation. We should enable users to 1) disable the cache via env var, and 2) specify the cache directory via env var, in order to support running capa in an ephemeral environment, such as docker/k8s. |
cache file format
object looks like:
|
cache file identifier/hashwe want to be able to detect when the cache is invalid, such as when the underlying rules have changed or the capa version has changed (and types/import paths may have changed). proposed:
This hash should be validated when loading the cache file. It should also be used to derive the file path of the cache file. This way, there can be multiple versions of capa and its rules on a system, each using a separate cache file. When an error is encountered with the cache file, such as if it failed to load, the cache file should be deleted and then fall back to non-cached rule loading. |
we'll need to build a tool to generate a cache file into a known location so that we can pre-build the cache file for the standalone binaries. the logic for using the cache file for standalone binaries should be fairly straightforward: if building the cache in CI and adding it to the pyinstaller build will take a few minutes but should be pretty straightforward. its binary data that should be added alongside the rules, just like the rules and signatures are today. |
Should we also do this for the signatures/signature analyzer? That part also takes a couple of seconds and rules almost never change. This may be done in |
I don’t think this will work for the signatures because the matcher is implemented in Rust and therefore cannot be pickled. I’m still a bit surprised how slow the signature loading is, but in the past when I’ve studied it, I wasn’t able to find any obvious places to fix. Perhaps I should try again to save another few seconds.
|
it can take a little while for capa to begin analyzing, because it has to load rules, signatures, etc.
we currently have 700+ rules, which i suspect might result in a lot of IO.
we should investigate how much of the loading time can be attributed to rule loading IO, and if the load time can be improved by optimizing the rule loading.
for example, perhaps we could support a zip archive of rules that's read into memory once, versus 700+ seek&read operations.
The text was updated successfully, but these errors were encountered: