-
Notifications
You must be signed in to change notification settings - Fork 578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make string length configurable and consistent across backends #1421
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add bug fixes, new features, breaking changes and anything else you think is worthwhile mentioning to the master (unreleased)
section of CHANGELOG.md. If no CHANGELOG update is needed add the following to the PR description: [x] No CHANGELOG update needed
CHANGELOG updated or no update needed, thanks! 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for creating this thorough PR for configurating the minimum string length. it contains a lot of good changes; still, i'd like to ask for some further work to clarify and simplify the code:
- i'm not sure that the CLI argument to capa will be often used, so let's remove that until we get a feature request. please modify each feature extractor class to accept a parameter
min_str_len
with default valueDEFAULT_STRING_LENGTH
. setting this constant can be our single place of configuration for now. - im wary of having to change the function signatures for all of the feature extractor callbacks to accept a kwargs dict. would you investigate passing the string configuration via the ctx dictionary instead? this strategy is a little less explicit, but since only one of many extractors uses the kwarg, it makes most of the code less noisy.
- i've made some suggestions about variable names. please apply the changes throughout.
please reach out if you have any questions - i'd hate for you to do any extra work due to me being unclear. thanks for your effort so far!
def __init__(self, path: str, len: int): | ||
super().__init__() | ||
self.pe: dnfile.dnPE = dnfile.dnPE(path) | ||
self.len = len |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets call this min_str_len
since its not obvious from the name len
that its used by the string extractor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed all names to min_len
as stated in a previous comment.
@@ -98,7 +99,7 @@ def get_functions(self) -> Iterator[FunctionHandle]: | |||
fh: FunctionHandle = FunctionHandle( | |||
address=DNTokenAddress(token), | |||
inner=method, | |||
ctx={"pe": self.pe, "calls_from": set(), "calls_to": set(), "cache": self.token_cache}, | |||
ctx={"pe": self.pe, "calls_from": set(), "calls_to": set(), "cache": self.token_cache, "len": self.len}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update this name, too, please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
incidentally, this is a great way to pass the string length configuration to the extractors. could we use this strategy more consistently to avoid having to update all the extract_*
callback function signatures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a file_ctx
dict to pass into extract_file_features
functions. This decreases the number of kwargs that need to be added to the function signatures.
yield from capa.features.extractors.dotnetfile.extract_file_function_names(pe=pe) | ||
|
||
|
||
def extract_file_strings(pe: dnfile.dnPE) -> Iterator[Tuple[String, Address]]: | ||
yield from capa.features.extractors.dotnetfile.extract_file_strings(pe=pe) | ||
def extract_file_strings(pe: dnfile.dnPE, len: int) -> Iterator[Tuple[String, Address]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def extract_file_strings(pe: dnfile.dnPE, len: int) -> Iterator[Tuple[String, Address]]: | |
def extract_file_strings(pe: dnfile.dnPE, min_len: int) -> Iterator[Tuple[String, Address]]: |
with open(self.path, "rb") as f: | ||
self.buf = f.read() | ||
|
||
# pre-compute these because we'll yield them at *every* scope. | ||
self.global_features: List[Tuple[Feature, Address]] = [] | ||
self.global_features.extend(capa.features.extractors.viv.file.extract_file_format(self.buf)) | ||
self.global_features.extend(capa.features.extractors.viv.file.extract_file_format(self.buf, self.len)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does extracting the file format need the string length configuration? can we remove this kwarg?
capa/features/extractors/viv/file.py
Outdated
@@ -91,7 +91,7 @@ def extract_file_function_names(vw, **kwargs) -> Iterator[Tuple[Feature, Address | |||
yield FunctionName(name[1:]), addr | |||
|
|||
|
|||
def extract_file_format(buf, **kwargs) -> Iterator[Tuple[Feature, Address]]: | |||
def extract_file_format(buf, len: int=len, **kwargs) -> Iterator[Tuple[Feature, Address]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this kwarg is unused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced argument with file_ctx
.
def extract_file_strings(buf, **kwargs) -> Iterator[Tuple[String, Address]]: | ||
def extract_file_strings(buf, min_len, **kwargs) -> Iterator[Tuple[String, Address]]: | ||
""" | ||
extract ASCII and UTF-16 LE strings from file | ||
""" | ||
for s in capa.features.extractors.strings.extract_ascii_strings(buf): | ||
for s in capa.features.extractors.strings.extract_ascii_strings(buf, min_len=min_len): | ||
yield String(s.s), FileOffsetAddress(s.offset) | ||
|
||
for s in capa.features.extractors.strings.extract_unicode_strings(buf): | ||
for s in capa.features.extractors.strings.extract_unicode_strings(buf, min_len=min_len): | ||
yield String(s.s), FileOffsetAddress(s.offset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great
@@ -89,7 +91,9 @@ def extract_global_features(self): | |||
yield from self.global_features | |||
|
|||
def extract_file_features(self): | |||
yield from capa.features.extractors.dnfile.file.extract_features(self.pe) | |||
yield from capa.features.extractors.dnfile.file.extract_features( | |||
file_ctx={"pe": self.pe, "min_len": self.min_len} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file_ctx={"pe": self.pe, "min_len": self.min_len} | |
ctx={"pe": self.pe, "min_str_len": self.min_str_len} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given that we're invoking the "file" extractor, its not necessary to include the term file_
in the parameter name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while this isn't wrong, I'd just like to keep the code as concise and consistent as possible.
@@ -18,37 +18,37 @@ | |||
from capa.features.address import Address | |||
|
|||
|
|||
def extract_file_import_names(pe: dnfile.dnPE) -> Iterator[Tuple[Import, Address]]: | |||
yield from capa.features.extractors.dotnetfile.extract_file_import_names(pe=pe) | |||
def extract_file_import_names(file_ctx) -> Iterator[Tuple[Import, Address]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def extract_file_import_names(file_ctx) -> Iterator[Tuple[Import, Address]]: | |
def extract_file_import_names(ctx) -> Iterator[Tuple[Import, Address]]: |
"calls_from": set(), | ||
"calls_to": set(), | ||
"cache": self.token_cache, | ||
"min_len": self.min_len, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"min_len": self.min_len, | |
"min_str_len": self.min_len, |
@@ -68,9 +69,10 @@ def get_type(self, token: int) -> Optional[Union[DnType, DnUnmanagedMethod]]: | |||
|
|||
|
|||
class DnfileFeatureExtractor(FeatureExtractor): | |||
def __init__(self, path: str): | |||
def __init__(self, path: str, min_len: int = DEFAULT_STRING_LENGTH): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def __init__(self, path: str, min_len: int = DEFAULT_STRING_LENGTH): | |
def __init__(self, path: str, min_str_len: int = DEFAULT_STRING_LENGTH): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's not enough context here to infer "min_len" of what? so lets include "str" in the variable/property names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since dnfile/file.py uses functions from dotnetfile.py, I changed extract_features
in dnfile/file.py to alter the ctx
variable for the dotnetfile functions ( ctx = {"min_len": ctx["min_str_len"]}
). I didn't change the other functions in dnfile/file.py to make them more concise. But I'm not sure if the inconsistency would be confusing?
super().__init__() | ||
self.pe: dnfile.dnPE = dnfile.dnPE(path) | ||
self.min_len = min_len |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.min_len = min_len | |
self.min_str_len = min_str_len |
@@ -191,7 +191,7 @@ def extract_insn_string_features(fh: FunctionHandle, bh, ih: InsnHandle) -> Iter | |||
if user_string is None: | |||
return | |||
|
|||
if len(user_string) >= 4: | |||
if len(user_string) >= fh.ctx["min_len"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if len(user_string) >= fh.ctx["min_len"]: | |
if len(user_string) >= fh.ctx["min_str_len"]: |
@staticmethod | ||
def get_function(ea: int) -> FunctionHandle: | ||
def get_function(self, ea: int) -> FunctionHandle: | ||
f = idaapi.get_func(ea) | ||
return FunctionHandle(address=AbsoluteVirtualAddress(f.start_ea), inner=f) | ||
return FunctionHandle(address=AbsoluteVirtualAddress(f.start_ea), inner=f, ctx={"min_len": self.min_len}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good change
@linpeiyu164 do you expect to be able to resume work on this PR and/or shall we lend a hand to close it out? |
I have made the requested changes from the last review. I didn't notice the merge conflict but I just resolved it. |
The mypy check was not triggered to run again so the results can't be seen here, but it ran successfully when I tested it locally: |
for file_handler in FILE_HANDLERS: | ||
for feature, address in file_handler(pe): | ||
for feature, address in file_handler(ctx={"pe": ctx["pe"], "min_len": ctx["min_str_len"]}): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should "min_len" be "min_str_len" as well here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I misunderstood the request before. I thought only the dotnetfile had to be changed to min_str_len and was really confused. I have changed them all except for those in common.py and strings.py.
@@ -134,7 +134,7 @@ def extract_file_section_names() -> Iterator[Tuple[Feature, Address]]: | |||
yield Section(idaapi.get_segm_name(seg)), AbsoluteVirtualAddress(seg.start_ea) | |||
|
|||
|
|||
def extract_file_strings() -> Iterator[Tuple[Feature, Address]]: | |||
def extract_file_strings(ctx=None) -> Iterator[Tuple[Feature, Address]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def extract_file_strings(ctx=None) -> Iterator[Tuple[Feature, Address]]: | |
def extract_file_strings(ctx) -> Iterator[Tuple[Feature, Address]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mypy gives an error if I don't add a default value for ctx here but I am not sure why.
I think it is related to this issue: python/mypy#9527
capa/features/extractors/ida/file.py:189: error: Cannot call function of unknown type [operator]
Found 1 error in 1 file (checked 53 source files)
@@ -87,8 +88,8 @@ def extract_file_section_names(pe, **kwargs): | |||
yield Section(name), AbsoluteVirtualAddress(base_address + section.VirtualAddress) | |||
|
|||
|
|||
def extract_file_strings(buf, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weren't the kwargs meant so that we could pass {"min_len": ...}
as such here and don't have to change all other signatures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that using ctx was so that kwargs doesn't have to be added to all the function signatures.
Since this loop requires that they have the same function signatures, I think it is easier to change this function instead of adding kwargs to all the other ones(because of the buf argument)
for file_handler in FILE_HANDLERS:
# file_handler: type: (pe, bytes) -> Iterable[Tuple[Feature, Address]]
for feature, va in file_handler(ctx=ctx): # type: ignore
yield feature, va
FILE_HANDLERS = (
extract_file_embedded_pe,
extract_file_export_names,
extract_file_import_names,
extract_file_section_names,
extract_file_strings,
extract_file_function_names,
extract_file_format,
)
closing due to inactivity and outdated code, will mark the issue as open again |
closes #1303
Checklist