Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A way to have optional strings in the rule as "0 of" behavior changed in v4.2.3 #1937

Closed
psrok1 opened this issue Jul 27, 2023 · 10 comments
Closed

Comments

@psrok1
Copy link

psrok1 commented Jul 27, 2023

Is your feature request related to a problem? Please describe.

We spotted that commit 7a99e6d released in v4.2.3 changed 0 of (...) behavior, so it broke some of our Yara rules:

rule [redacted]
{
    strings:
        [redacted]
    condition:
        ( 2 of ($op_*) )
        and 1 of ($str_encoded_*)
        and 0 of ($set_new_cookie, $magic*)  // for callback only
}

0 of was used in our rules to overcome the fact that Yara doesn't allow matching strings that are not referenced in the condition. In our case we wanted to have "optional" strings that don't affect the match result, but are still looked up so we can get their offsets and extract additional information.

Previous meaning of x of (...) was "match if there are at least x occurrences" of the specified string, so 0 of still matched the original semantics well.

I understand that it started to be confusing when none keyword arrived so none of (...) behaved the same like 0 of (...) as noted in #1695

Describe the solution you'd like

It would be nice to have another way to indicate that some strings are intentionally unreferenced in the condition, but should still be matched. Right now we're doing "staged matching" in multiple places, spawning another Yara match with optional strings and any of them condition.

Possible solutions I would like are:

  • different behavior of none of and 0 of, leaving the original 0 of meaning
  • string modifier that marks string as intentionally unreferenced in the condition

Let me know what do you think about it!

@plusvic
Copy link
Member

plusvic commented Jul 28, 2023

Can you elaborate more on the concept of "staged matching"? I understand you are interested in those unreferenced strings in the rule, but is not clear to me whether you are interested in them only when the rest of the rule matches, or when it doesn't match, or in both cases.

@psrok1
Copy link
Author

psrok1 commented Jul 28, 2023

Yes, we're interested in them only when the rest of the rule matches. By "staged matching" I mean:

  • main rule for a malware classification (strings referenced in condition) that must pass to classify malware as specific family
  • second rule used when main rule has matched. We use it for looking for additional string references that allow us to gather additional information, but are not crucial for proper classification. From classification perspective: we can't include them in the main rule using or/1 of clause because they're not specific enough for general matching (we may get false-positives), but it's also fine when we get no matches at all (we just get only part of information, which could be an indication that part of rule must be fixed).

In some cases we decided that it would be great to get complete information from single rule in one pass, that's why 0 of hack appeared in our ruleset 😄 Especially if these strings are still fast enough for general matching.

@msm-code
Copy link

To give a more concrete example: we write "normal" yara rules first, with the goal of hunting for the specific malware family. We don't have use for 0 of in them.

Then we put the yara rules in our malware extraction system (built using https://github.com/CERT-Polska/malduck, our framework used by a few other orgs). The system runs a callback for every string matched by yara rule. And now, sometimes it's useful to add a "non-detecting" string that will only be used in the malware extraction (for example, string that detects encryption function, that we use in our module toextract the encryption key). We don't want to mess with detections at this point, since the rule is already tested, we just want to have another optional string. Sometimes it's possible to do this in other way, sometimes the 0 of hack comes handy. Lack of workaround is not the biggest issue, but we worry a bit about the backward compatiblity of our system (and other users of our project).

This is probably not the only use case we had, but the one that we stumbled upon first.

@plusvic
Copy link
Member

plusvic commented Jul 28, 2023

Interesting, I never thought that someone would be using 0 of them in real-life rules. That's why 0 of them was made a special case to make it coherent with none of them.

I'm not sure what's the better solution here, I don't dislike the option of going back to the original meaning for 0 of them and treating none of them as a different case. I'm going to gather opinions among other YARA users and see what comes out of it.

@wxsBSD
Copy link
Collaborator

wxsBSD commented Jul 28, 2023

I like the idea of marking a string as intentionally unreferenced. Seems like an elegant solution to this problem while keeping the semantics of 0 of them and none of them clear. Also, intentionally unreferenced strings opens up precisely the scenario you are describing of having your callback process a string that happened to match even if you don't need it in the condition. Seems like a powerful thing to have the capability to do.

@mgoffin
Copy link
Contributor

mgoffin commented Jul 28, 2023

+1 for the string modifier idea!

@malvidin
Copy link
Contributor

I modified my rules to use #optional_string >= 0 instead of 0 of $optional_string after the change.

@vthib
Copy link
Contributor

vthib commented Jul 29, 2023

In my rules where I want to compute some strings but not have them influence the condition, i use the # >= 0 trick: <real condition> and for all of ($optional_*): (# >= 0).

Another point in favor of the string modifier, or against changing back the 0 of them meaning, is providing intent and opening the door for optimizations. To specify a bit more, it can be useful to know whether, in order to find if a rule matches or not (ie if the condition is true or not):

  • it is enough to know whether a string has a match (so just the any of them condition for example)
  • finding out the number of matches and their exact offsets is needed (for any in in (1..#a): (!a[i] > 5) for example)
    This is because it is usually cheaper for a regex to simply know if there is a match compared to computing the exact boundaries of the match. Also, if a string already has a match, and this information is enough, further check on matches can be skipped.

The issue with the 0 of (...) syntax is that it does not indicate that the full matches computations are needed for those strings. So it's a bit of an all or nothing. Either optimize all strings, but you might miss some match callbacks. Or do not optimize any strings, and you might lose a bit of performance.
However, with a string modifier, it is possible to distinguish the two sets of strings, and then only optimize the matching strings, and not the "optional" ones

wxsBSD added a commit to wxsBSD/yara that referenced this issue Aug 1, 2023
As briefly discussed in VirusTotal#1937, this change will make it so that any string
identifier that starts with '_' can be unreferenced. Any anonymous strings
must still be referenced.

While testing this out I realized that an unreferenced string still had the
STRING_FLAG_FIXED_OFFSET set, which meant any unreferenced string would have a
fixed_offset of YR_UNDEFINED. To deal with this when we are reducing the rule
we unset STRING_FLAG_FIXED_OFFSET if the string is unreferenced.
@wxsBSD
Copy link
Collaborator

wxsBSD commented Aug 1, 2023

I put together a PR that would allow for unreferenced strings if they are prefixed with $_. They should be treated completely normally other than you don't have to reference them in the condition. They will still be searched for normally and available in callbacks. I liked the idea of using $_ to signal to the compiler that it was intentionally unreferenced instead of introducing another modifier, because modifiers are meant to indicate to the reader that the string is being modified.

This shouldn't break anyone who is already using $_ for any reason - just gives them the option to make them unreferenced in the future.

plusvic pushed a commit that referenced this issue Aug 23, 2023
As briefly discussed in #1937, this change will make it so that any string
identifier that starts with '_' can be unreferenced. Any anonymous strings
must still be referenced.

While testing this out I realized that an unreferenced string still had the
STRING_FLAG_FIXED_OFFSET set, which meant any unreferenced string would have a
fixed_offset of YR_UNDEFINED. To deal with this when we are reducing the rule
we unset STRING_FLAG_FIXED_OFFSET if the string is unreferenced.
@plusvic
Copy link
Member

plusvic commented Aug 23, 2023

#1941 has been merged. So we can have optional strings by prefixing the identifier with an underscore (e.g: $_unused)

@plusvic plusvic closed this as completed Aug 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants