make string length configurable and consistent across backends #1421

linpeiyu164 · 2023-03-31T17:05:31Z

Checklist

No CHANGELOG update needed

No new tests needed

No documentation update needed

github-actions

Please add bug fixes, new features, breaking changes and anything else you think is worthwhile mentioning to the master (unreleased) section of CHANGELOG.md. If no CHANGELOG update is needed add the following to the PR description: [x] No CHANGELOG update needed

CHANGELOG updated or no update needed, thanks! 😄

williballenthin

thank you for creating this thorough PR for configurating the minimum string length. it contains a lot of good changes; still, i'd like to ask for some further work to clarify and simplify the code:

i'm not sure that the CLI argument to capa will be often used, so let's remove that until we get a feature request. please modify each feature extractor class to accept a parameter min_str_len with default value DEFAULT_STRING_LENGTH. setting this constant can be our single place of configuration for now.
im wary of having to change the function signatures for all of the feature extractor callbacks to accept a kwargs dict. would you investigate passing the string configuration via the ctx dictionary instead? this strategy is a little less explicit, but since only one of many extractors uses the kwarg, it makes most of the code less noisy.
i've made some suggestions about variable names. please apply the changes throughout.

please reach out if you have any questions - i'd hate for you to do any extra work due to me being unclear. thanks for your effort so far!

capa/features/extractors/common.py

capa/features/extractors/dnfile/extractor.py

williballenthin · 2023-04-03T10:24:51Z

capa/features/extractors/dnfile/extractor.py

+    def __init__(self, path: str, len: int):
        super().__init__()
        self.pe: dnfile.dnPE = dnfile.dnPE(path)
+        self.len = len


lets call this min_str_len since its not obvious from the name len that its used by the string extractor.

Changed all names to min_len as stated in a previous comment.

williballenthin · 2023-04-03T10:25:15Z

capa/features/extractors/dnfile/extractor.py

@@ -98,7 +99,7 @@ def get_functions(self) -> Iterator[FunctionHandle]:
            fh: FunctionHandle = FunctionHandle(
                address=DNTokenAddress(token),
                inner=method,
-                ctx={"pe": self.pe, "calls_from": set(), "calls_to": set(), "cache": self.token_cache},
+                ctx={"pe": self.pe, "calls_from": set(), "calls_to": set(), "cache": self.token_cache, "len": self.len},


update this name, too, please.

incidentally, this is a great way to pass the string length configuration to the extractors. could we use this strategy more consistently to avoid having to update all the extract_* callback function signatures?

I created a file_ctx dict to pass into extract_file_features functions. This decreases the number of kwargs that need to be added to the function signatures.

williballenthin · 2023-04-03T10:25:48Z

capa/features/extractors/dnfile/file.py

    yield from capa.features.extractors.dotnetfile.extract_file_function_names(pe=pe)


-def extract_file_strings(pe: dnfile.dnPE) -> Iterator[Tuple[String, Address]]:
-    yield from capa.features.extractors.dotnetfile.extract_file_strings(pe=pe)
+def extract_file_strings(pe: dnfile.dnPE, len: int) -> Iterator[Tuple[String, Address]]:


Suggested change

def extract_file_strings(pe: dnfile.dnPE, len: int) -> Iterator[Tuple[String, Address]]:

def extract_file_strings(pe: dnfile.dnPE, min_len: int) -> Iterator[Tuple[String, Address]]:

capa/features/extractors/elffile.py

capa/features/extractors/strings.py

capa/features/extractors/viv/extractor.py

williballenthin · 2023-04-03T10:30:46Z

capa/features/extractors/viv/extractor.py

        with open(self.path, "rb") as f:
            self.buf = f.read()

        # pre-compute these because we'll yield them at *every* scope.
        self.global_features: List[Tuple[Feature, Address]] = []
-        self.global_features.extend(capa.features.extractors.viv.file.extract_file_format(self.buf))
+        self.global_features.extend(capa.features.extractors.viv.file.extract_file_format(self.buf, self.len))


why does extracting the file format need the string length configuration? can we remove this kwarg?

williballenthin · 2023-04-03T10:31:01Z

capa/features/extractors/viv/file.py

@@ -91,7 +91,7 @@ def extract_file_function_names(vw, **kwargs) -> Iterator[Tuple[Feature, Address
                yield FunctionName(name[1:]), addr


-def extract_file_format(buf, **kwargs) -> Iterator[Tuple[Feature, Address]]:
+def extract_file_format(buf, len: int=len, **kwargs) -> Iterator[Tuple[Feature, Address]]:


this kwarg is unused?

Replaced argument with file_ctx.

williballenthin · 2023-04-05T12:08:41Z

capa/features/extractors/common.py

-def extract_file_strings(buf, **kwargs) -> Iterator[Tuple[String, Address]]:
+def extract_file_strings(buf, min_len, **kwargs) -> Iterator[Tuple[String, Address]]:
    """
    extract ASCII and UTF-16 LE strings from file
    """
-    for s in capa.features.extractors.strings.extract_ascii_strings(buf):
+    for s in capa.features.extractors.strings.extract_ascii_strings(buf, min_len=min_len):
        yield String(s.s), FileOffsetAddress(s.offset)

-    for s in capa.features.extractors.strings.extract_unicode_strings(buf):
+    for s in capa.features.extractors.strings.extract_unicode_strings(buf, min_len=min_len):
        yield String(s.s), FileOffsetAddress(s.offset)


williballenthin · 2023-04-05T12:08:50Z

capa/features/extractors/dnfile/extractor.py

@@ -89,7 +91,9 @@ def extract_global_features(self):
        yield from self.global_features

    def extract_file_features(self):
-        yield from capa.features.extractors.dnfile.file.extract_features(self.pe)
+        yield from capa.features.extractors.dnfile.file.extract_features(
+            file_ctx={"pe": self.pe, "min_len": self.min_len}


Suggested change

file_ctx={"pe": self.pe, "min_len": self.min_len}

ctx={"pe": self.pe, "min_str_len": self.min_str_len}

given that we're invoking the "file" extractor, its not necessary to include the term file_ in the parameter name.

while this isn't wrong, I'd just like to keep the code as concise and consistent as possible.

williballenthin · 2023-04-05T12:09:17Z

capa/features/extractors/dnfile/file.py

@@ -18,37 +18,37 @@
 from capa.features.address import Address


-def extract_file_import_names(pe: dnfile.dnPE) -> Iterator[Tuple[Import, Address]]:
-    yield from capa.features.extractors.dotnetfile.extract_file_import_names(pe=pe)
+def extract_file_import_names(file_ctx) -> Iterator[Tuple[Import, Address]]:


Suggested change

def extract_file_import_names(file_ctx) -> Iterator[Tuple[Import, Address]]:

def extract_file_import_names(ctx) -> Iterator[Tuple[Import, Address]]:

williballenthin · 2023-04-05T12:09:51Z

capa/features/extractors/dnfile/extractor.py

+                    "calls_from": set(),
+                    "calls_to": set(),
+                    "cache": self.token_cache,
+                    "min_len": self.min_len,


Suggested change

"min_len": self.min_len,

"min_str_len": self.min_len,

williballenthin · 2023-04-05T12:10:04Z

capa/features/extractors/dnfile/extractor.py

@@ -68,9 +69,10 @@ def get_type(self, token: int) -> Optional[Union[DnType, DnUnmanagedMethod]]:


 class DnfileFeatureExtractor(FeatureExtractor):
-    def __init__(self, path: str):
+    def __init__(self, path: str, min_len: int = DEFAULT_STRING_LENGTH):


Suggested change

def __init__(self, path: str, min_len: int = DEFAULT_STRING_LENGTH):

def __init__(self, path: str, min_str_len: int = DEFAULT_STRING_LENGTH):

there's not enough context here to infer "min_len" of what? so lets include "str" in the variable/property names

Since dnfile/file.py uses functions from dotnetfile.py, I changed extract_features in dnfile/file.py to alter the ctx variable for the dotnetfile functions ( ctx = {"min_len": ctx["min_str_len"]} ). I didn't change the other functions in dnfile/file.py to make them more concise. But I'm not sure if the inconsistency would be confusing?

williballenthin · 2023-04-05T12:10:14Z

capa/features/extractors/dnfile/extractor.py

        super().__init__()
        self.pe: dnfile.dnPE = dnfile.dnPE(path)
+        self.min_len = min_len


Suggested change

self.min_len = min_len

self.min_str_len = min_str_len

williballenthin · 2023-04-05T12:12:13Z

capa/features/extractors/dnfile/insn.py

@@ -191,7 +191,7 @@ def extract_insn_string_features(fh: FunctionHandle, bh, ih: InsnHandle) -> Iter
    if user_string is None:
        return

-    if len(user_string) >= 4:
+    if len(user_string) >= fh.ctx["min_len"]:


Suggested change

if len(user_string) >= fh.ctx["min_len"]:

if len(user_string) >= fh.ctx["min_str_len"]:

williballenthin · 2023-04-05T12:13:36Z

capa/features/extractors/ida/extractor.py

-    @staticmethod
-    def get_function(ea: int) -> FunctionHandle:
+    def get_function(self, ea: int) -> FunctionHandle:
        f = idaapi.get_func(ea)
-        return FunctionHandle(address=AbsoluteVirtualAddress(f.start_ea), inner=f)
+        return FunctionHandle(address=AbsoluteVirtualAddress(f.start_ea), inner=f, ctx={"min_len": self.min_len})


good change

williballenthin · 2023-05-10T11:46:26Z

@linpeiyu164 do you expect to be able to resume work on this PR and/or shall we lend a hand to close it out?

linpeiyu164 · 2023-05-10T13:35:19Z

I have made the requested changes from the last review. I didn't notice the merge conflict but I just resolved it.
There seems to be a failing check, but I think the PR will be done once I deal with it.
Is there anything else I should change?

linpeiyu164 · 2023-05-10T15:35:46Z

The mypy check was not triggered to run again so the results can't be seen here, but it ran successfully when I tested it locally:
$ mypy --config-file .github/mypy/mypy.ini --check-untyped-defs capa/features/
Success: no issues found in 53 source files

mr-tz · 2023-05-11T06:57:03Z

capa/features/extractors/dnfile/file.py

    for file_handler in FILE_HANDLERS:
-        for feature, address in file_handler(pe):
+        for feature, address in file_handler(ctx={"pe": ctx["pe"], "min_len": ctx["min_str_len"]}):


should "min_len" be "min_str_len" as well here?

I think I misunderstood the request before. I thought only the dotnetfile had to be changed to min_str_len and was really confused. I have changed them all except for those in common.py and strings.py.

mr-tz · 2023-05-11T07:02:21Z

capa/features/extractors/ida/file.py

@@ -134,7 +134,7 @@ def extract_file_section_names() -> Iterator[Tuple[Feature, Address]]:
        yield Section(idaapi.get_segm_name(seg)), AbsoluteVirtualAddress(seg.start_ea)


-def extract_file_strings() -> Iterator[Tuple[Feature, Address]]:
+def extract_file_strings(ctx=None) -> Iterator[Tuple[Feature, Address]]:


Suggested change

def extract_file_strings(ctx=None) -> Iterator[Tuple[Feature, Address]]:

def extract_file_strings(ctx) -> Iterator[Tuple[Feature, Address]]:

mypy gives an error if I don't add a default value for ctx here but I am not sure why.
I think it is related to this issue: python/mypy#9527
capa/features/extractors/ida/file.py:189: error: Cannot call function of unknown type [operator]
Found 1 error in 1 file (checked 53 source files)

mr-tz · 2023-05-11T07:11:09Z

capa/features/extractors/pefile.py

@@ -87,8 +88,8 @@ def extract_file_section_names(pe, **kwargs):
        yield Section(name), AbsoluteVirtualAddress(base_address + section.VirtualAddress)


-def extract_file_strings(buf, **kwargs):


weren't the kwargs meant so that we could pass {"min_len": ...} as such here and don't have to change all other signatures?

I thought that using ctx was so that kwargs doesn't have to be added to all the function signatures.
Since this loop requires that they have the same function signatures, I think it is easier to change this function instead of adding kwargs to all the other ones(because of the buf argument)

for file_handler in FILE_HANDLERS: # file_handler: type: (pe, bytes) -> Iterable[Tuple[Feature, Address]] for feature, va in file_handler(ctx=ctx): # type: ignore yield feature, va FILE_HANDLERS = ( extract_file_embedded_pe, extract_file_export_names, extract_file_import_names, extract_file_section_names, extract_file_strings, extract_file_function_names, extract_file_format, )

mr-tz · 2024-03-22T10:11:39Z

closing due to inactivity and outdated code, will mark the issue as open again

make string length configurable and consistent across backends

e86a06b

github-actions bot previously requested changes Mar 31, 2023

View reviewed changes

linpeiyu164 and others added 5 commits April 1, 2023 12:07

make string length configurable and consistent across backends

ad87e84

Merge branch 'issue1303' of github.com:linpeiyu164/capa into issue1303

9f71ac4

fix argument bug

3c5d538

Merge branch 'mandiant:master' into issue1303

7613e83

fix argument bug

dc47ffa

linpeiyu164 marked this pull request as draft April 1, 2023 14:52

fix bugs and pass tests

e2a30bc

williballenthin requested changes Apr 3, 2023

View reviewed changes

linpeiyu164 added 3 commits April 4, 2023 00:57

change variable names and add file_ctx

7934890

change variable names and add file_ctx

dfd8641

change variable names and add file_ctx

1fcfe62

linpeiyu164 marked this pull request as ready for review April 3, 2023 19:17

mr-tz added the dont merge Indicate a PR that is still being worked on label Apr 4, 2023

williballenthin requested changes Apr 5, 2023

View reviewed changes

change file_ctx to ctx. change min_len to min_str_len for dnfile

021ede8

Merge branch 'master' into issue1303

18f3135

linpeiyu164 added 3 commits May 10, 2023 22:23

add missing argument

b912804

fix mypy check fail

01b322a

fix mypy check fail

5520761

mr-tz reviewed May 11, 2023

View reviewed changes

linpeiyu164 added 2 commits May 11, 2023 18:00

change to min_str_len

b6d3c44

change to min_str_len

8167686

williballenthin marked this pull request as draft July 13, 2023 08:46

williballenthin mentioned this pull request Sep 25, 2023

make string length configurable and consistent across backends #1303

Open

mr-tz closed this Mar 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make string length configurable and consistent across backends #1421

make string length configurable and consistent across backends #1421

linpeiyu164 commented Mar 31, 2023 •

edited

Loading

github-actions bot left a comment

williballenthin left a comment

williballenthin Apr 3, 2023

linpeiyu164 Apr 3, 2023

williballenthin Apr 3, 2023

williballenthin Apr 3, 2023

linpeiyu164 Apr 3, 2023

williballenthin Apr 3, 2023

williballenthin Apr 3, 2023

williballenthin Apr 3, 2023

linpeiyu164 Apr 3, 2023

williballenthin Apr 5, 2023

williballenthin Apr 5, 2023

williballenthin Apr 5, 2023

williballenthin Apr 5, 2023

williballenthin Apr 5, 2023

williballenthin Apr 5, 2023

williballenthin Apr 5, 2023

williballenthin Apr 5, 2023

linpeiyu164 Apr 5, 2023

williballenthin Apr 5, 2023

williballenthin Apr 5, 2023

williballenthin Apr 5, 2023

williballenthin commented May 10, 2023

linpeiyu164 commented May 10, 2023

linpeiyu164 commented May 10, 2023

mr-tz May 11, 2023

linpeiyu164 May 11, 2023

mr-tz May 11, 2023

linpeiyu164 May 11, 2023

mr-tz May 11, 2023

linpeiyu164 May 11, 2023 •

edited

Loading

mr-tz commented Mar 22, 2024

	def extract_file_strings(pe: dnfile.dnPE, len: int) -> Iterator[Tuple[String, Address]]:
	def extract_file_strings(pe: dnfile.dnPE, min_len: int) -> Iterator[Tuple[String, Address]]:

	file_ctx={"pe": self.pe, "min_len": self.min_len}
	ctx={"pe": self.pe, "min_str_len": self.min_str_len}

	def extract_file_import_names(file_ctx) -> Iterator[Tuple[Import, Address]]:
	def extract_file_import_names(ctx) -> Iterator[Tuple[Import, Address]]:

	def __init__(self, path: str, min_len: int = DEFAULT_STRING_LENGTH):
	def __init__(self, path: str, min_str_len: int = DEFAULT_STRING_LENGTH):

	if len(user_string) >= fh.ctx["min_len"]:
	if len(user_string) >= fh.ctx["min_str_len"]:

	def extract_file_strings(ctx=None) -> Iterator[Tuple[Feature, Address]]:
	def extract_file_strings(ctx) -> Iterator[Tuple[Feature, Address]]:

		@@ -87,8 +88,8 @@ def extract_file_section_names(pe, **kwargs):
		yield Section(name), AbsoluteVirtualAddress(base_address + section.VirtualAddress)


		def extract_file_strings(buf, **kwargs):

make string length configurable and consistent across backends #1421

make string length configurable and consistent across backends #1421

Conversation

linpeiyu164 commented Mar 31, 2023 • edited Loading

Checklist

github-actions bot left a comment

Choose a reason for hiding this comment

williballenthin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

williballenthin commented May 10, 2023

linpeiyu164 commented May 10, 2023

linpeiyu164 commented May 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

linpeiyu164 May 11, 2023 • edited Loading

Choose a reason for hiding this comment

mr-tz commented Mar 22, 2024

linpeiyu164 commented Mar 31, 2023 •

edited

Loading

linpeiyu164 May 11, 2023 •

edited

Loading