Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make string length configurable and consistent across backends #1421

Closed
wants to merge 17 commits into from

Conversation

linpeiyu164
Copy link
Contributor

@linpeiyu164 linpeiyu164 commented Mar 31, 2023

closes #1303

Checklist

  • No CHANGELOG update needed
  • No new tests needed
  • No documentation update needed

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add bug fixes, new features, breaking changes and anything else you think is worthwhile mentioning to the master (unreleased) section of CHANGELOG.md. If no CHANGELOG update is needed add the following to the PR description: [x] No CHANGELOG update needed

@linpeiyu164 linpeiyu164 marked this pull request as draft April 1, 2023 14:52
@github-actions github-actions bot dismissed their stale review April 1, 2023 15:08

CHANGELOG updated or no update needed, thanks! 😄

Copy link
Collaborator

@williballenthin williballenthin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for creating this thorough PR for configurating the minimum string length. it contains a lot of good changes; still, i'd like to ask for some further work to clarify and simplify the code:

  1. i'm not sure that the CLI argument to capa will be often used, so let's remove that until we get a feature request. please modify each feature extractor class to accept a parameter min_str_len with default value DEFAULT_STRING_LENGTH. setting this constant can be our single place of configuration for now.
  2. im wary of having to change the function signatures for all of the feature extractor callbacks to accept a kwargs dict. would you investigate passing the string configuration via the ctx dictionary instead? this strategy is a little less explicit, but since only one of many extractors uses the kwarg, it makes most of the code less noisy.
  3. i've made some suggestions about variable names. please apply the changes throughout.

please reach out if you have any questions - i'd hate for you to do any extra work due to me being unclear. thanks for your effort so far!

Comment on lines 71 to 74
def __init__(self, path: str, len: int):
super().__init__()
self.pe: dnfile.dnPE = dnfile.dnPE(path)
self.len = len
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets call this min_str_len since its not obvious from the name len that its used by the string extractor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed all names to min_len as stated in a previous comment.

@@ -98,7 +99,7 @@ def get_functions(self) -> Iterator[FunctionHandle]:
fh: FunctionHandle = FunctionHandle(
address=DNTokenAddress(token),
inner=method,
ctx={"pe": self.pe, "calls_from": set(), "calls_to": set(), "cache": self.token_cache},
ctx={"pe": self.pe, "calls_from": set(), "calls_to": set(), "cache": self.token_cache, "len": self.len},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update this name, too, please.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incidentally, this is a great way to pass the string length configuration to the extractors. could we use this strategy more consistently to avoid having to update all the extract_* callback function signatures?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a file_ctx dict to pass into extract_file_features functions. This decreases the number of kwargs that need to be added to the function signatures.

yield from capa.features.extractors.dotnetfile.extract_file_function_names(pe=pe)


def extract_file_strings(pe: dnfile.dnPE) -> Iterator[Tuple[String, Address]]:
yield from capa.features.extractors.dotnetfile.extract_file_strings(pe=pe)
def extract_file_strings(pe: dnfile.dnPE, len: int) -> Iterator[Tuple[String, Address]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def extract_file_strings(pe: dnfile.dnPE, len: int) -> Iterator[Tuple[String, Address]]:
def extract_file_strings(pe: dnfile.dnPE, min_len: int) -> Iterator[Tuple[String, Address]]:

with open(self.path, "rb") as f:
self.buf = f.read()

# pre-compute these because we'll yield them at *every* scope.
self.global_features: List[Tuple[Feature, Address]] = []
self.global_features.extend(capa.features.extractors.viv.file.extract_file_format(self.buf))
self.global_features.extend(capa.features.extractors.viv.file.extract_file_format(self.buf, self.len))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does extracting the file format need the string length configuration? can we remove this kwarg?

@@ -91,7 +91,7 @@ def extract_file_function_names(vw, **kwargs) -> Iterator[Tuple[Feature, Address
yield FunctionName(name[1:]), addr


def extract_file_format(buf, **kwargs) -> Iterator[Tuple[Feature, Address]]:
def extract_file_format(buf, len: int=len, **kwargs) -> Iterator[Tuple[Feature, Address]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this kwarg is unused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced argument with file_ctx.

@linpeiyu164 linpeiyu164 marked this pull request as ready for review April 3, 2023 19:17
@mr-tz mr-tz added the dont merge Indicate a PR that is still being worked on label Apr 4, 2023
Comment on lines -39 to 47
def extract_file_strings(buf, **kwargs) -> Iterator[Tuple[String, Address]]:
def extract_file_strings(buf, min_len, **kwargs) -> Iterator[Tuple[String, Address]]:
"""
extract ASCII and UTF-16 LE strings from file
"""
for s in capa.features.extractors.strings.extract_ascii_strings(buf):
for s in capa.features.extractors.strings.extract_ascii_strings(buf, min_len=min_len):
yield String(s.s), FileOffsetAddress(s.offset)

for s in capa.features.extractors.strings.extract_unicode_strings(buf):
for s in capa.features.extractors.strings.extract_unicode_strings(buf, min_len=min_len):
yield String(s.s), FileOffsetAddress(s.offset)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great

@@ -89,7 +91,9 @@ def extract_global_features(self):
yield from self.global_features

def extract_file_features(self):
yield from capa.features.extractors.dnfile.file.extract_features(self.pe)
yield from capa.features.extractors.dnfile.file.extract_features(
file_ctx={"pe": self.pe, "min_len": self.min_len}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
file_ctx={"pe": self.pe, "min_len": self.min_len}
ctx={"pe": self.pe, "min_str_len": self.min_str_len}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that we're invoking the "file" extractor, its not necessary to include the term file_ in the parameter name.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while this isn't wrong, I'd just like to keep the code as concise and consistent as possible.

@@ -18,37 +18,37 @@
from capa.features.address import Address


def extract_file_import_names(pe: dnfile.dnPE) -> Iterator[Tuple[Import, Address]]:
yield from capa.features.extractors.dotnetfile.extract_file_import_names(pe=pe)
def extract_file_import_names(file_ctx) -> Iterator[Tuple[Import, Address]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def extract_file_import_names(file_ctx) -> Iterator[Tuple[Import, Address]]:
def extract_file_import_names(ctx) -> Iterator[Tuple[Import, Address]]:

"calls_from": set(),
"calls_to": set(),
"cache": self.token_cache,
"min_len": self.min_len,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"min_len": self.min_len,
"min_str_len": self.min_len,

@@ -68,9 +69,10 @@ def get_type(self, token: int) -> Optional[Union[DnType, DnUnmanagedMethod]]:


class DnfileFeatureExtractor(FeatureExtractor):
def __init__(self, path: str):
def __init__(self, path: str, min_len: int = DEFAULT_STRING_LENGTH):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def __init__(self, path: str, min_len: int = DEFAULT_STRING_LENGTH):
def __init__(self, path: str, min_str_len: int = DEFAULT_STRING_LENGTH):

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's not enough context here to infer "min_len" of what? so lets include "str" in the variable/property names

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since dnfile/file.py uses functions from dotnetfile.py, I changed extract_features in dnfile/file.py to alter the ctx variable for the dotnetfile functions ( ctx = {"min_len": ctx["min_str_len"]} ). I didn't change the other functions in dnfile/file.py to make them more concise. But I'm not sure if the inconsistency would be confusing?

super().__init__()
self.pe: dnfile.dnPE = dnfile.dnPE(path)
self.min_len = min_len
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.min_len = min_len
self.min_str_len = min_str_len

@@ -191,7 +191,7 @@ def extract_insn_string_features(fh: FunctionHandle, bh, ih: InsnHandle) -> Iter
if user_string is None:
return

if len(user_string) >= 4:
if len(user_string) >= fh.ctx["min_len"]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if len(user_string) >= fh.ctx["min_len"]:
if len(user_string) >= fh.ctx["min_str_len"]:

Comment on lines 47 to 51
@staticmethod
def get_function(ea: int) -> FunctionHandle:
def get_function(self, ea: int) -> FunctionHandle:
f = idaapi.get_func(ea)
return FunctionHandle(address=AbsoluteVirtualAddress(f.start_ea), inner=f)
return FunctionHandle(address=AbsoluteVirtualAddress(f.start_ea), inner=f, ctx={"min_len": self.min_len})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good change

@williballenthin
Copy link
Collaborator

@linpeiyu164 do you expect to be able to resume work on this PR and/or shall we lend a hand to close it out?

@linpeiyu164
Copy link
Contributor Author

I have made the requested changes from the last review. I didn't notice the merge conflict but I just resolved it.
There seems to be a failing check, but I think the PR will be done once I deal with it.
Is there anything else I should change?

@linpeiyu164
Copy link
Contributor Author

The mypy check was not triggered to run again so the results can't be seen here, but it ran successfully when I tested it locally:
$ mypy --config-file .github/mypy/mypy.ini --check-untyped-defs capa/features/
Success: no issues found in 53 source files

for file_handler in FILE_HANDLERS:
for feature, address in file_handler(pe):
for feature, address in file_handler(ctx={"pe": ctx["pe"], "min_len": ctx["min_str_len"]}):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should "min_len" be "min_str_len" as well here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I misunderstood the request before. I thought only the dotnetfile had to be changed to min_str_len and was really confused. I have changed them all except for those in common.py and strings.py.

@@ -134,7 +134,7 @@ def extract_file_section_names() -> Iterator[Tuple[Feature, Address]]:
yield Section(idaapi.get_segm_name(seg)), AbsoluteVirtualAddress(seg.start_ea)


def extract_file_strings() -> Iterator[Tuple[Feature, Address]]:
def extract_file_strings(ctx=None) -> Iterator[Tuple[Feature, Address]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def extract_file_strings(ctx=None) -> Iterator[Tuple[Feature, Address]]:
def extract_file_strings(ctx) -> Iterator[Tuple[Feature, Address]]:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mypy gives an error if I don't add a default value for ctx here but I am not sure why.
I think it is related to this issue: python/mypy#9527
capa/features/extractors/ida/file.py:189: error: Cannot call function of unknown type [operator]
Found 1 error in 1 file (checked 53 source files)

@@ -87,8 +88,8 @@ def extract_file_section_names(pe, **kwargs):
yield Section(name), AbsoluteVirtualAddress(base_address + section.VirtualAddress)


def extract_file_strings(buf, **kwargs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weren't the kwargs meant so that we could pass {"min_len": ...} as such here and don't have to change all other signatures?

Copy link
Contributor Author

@linpeiyu164 linpeiyu164 May 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that using ctx was so that kwargs doesn't have to be added to all the function signatures.
Since this loop requires that they have the same function signatures, I think it is easier to change this function instead of adding kwargs to all the other ones(because of the buf argument)

    for file_handler in FILE_HANDLERS:
        # file_handler: type: (pe, bytes) -> Iterable[Tuple[Feature, Address]]
        for feature, va in file_handler(ctx=ctx):  # type: ignore
            yield feature, va


FILE_HANDLERS = (
    extract_file_embedded_pe,
    extract_file_export_names,
    extract_file_import_names,
    extract_file_section_names,
    extract_file_strings,
    extract_file_function_names,
    extract_file_format,
)

@mr-tz
Copy link
Collaborator

mr-tz commented Mar 22, 2024

closing due to inactivity and outdated code, will mark the issue as open again

@mr-tz mr-tz closed this Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dont merge Indicate a PR that is still being worked on
Projects
None yet
Development

Successfully merging this pull request may close these issues.

make string length configurable and consistent across backends
3 participants