-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement rule sets. #1597
Implement rule sets. #1597
Conversation
This adds the ability to use rule sets just like we do with string sets. Specifically, it enables conditions like the following: 5 of (a, b) 50% of (a*) where "a" is an existing rule and "b*" is a wildcard rule identifier, which is expanded out precisely as wildcard string identifiers are. I chose to make OP_OF and OP_OF_PERCENT emit an argument which is the "type" of set it is expected to match on. The two acceptable values are OF_STRING_SET and OF_RULE_SET. When OF_STRING_SET is used then the references on the stack are expected to be strings and OF_RULE_SET will check if the rules on the stack match. I added a yr_parser_emit_pushes_for_rules() which behaves similarly to the string counterpart. The one big exception is that I can't use yr_rules_foreach macro since that needs the terminating "null" rule which is not yet added, so instead I have to iterate the rules manually using yr_arena_get_ptr() to walk all the currently compiled rules.
a simple way to use this feature without this pr is using condition:
# at lease two rules
math.to_number(rule1) + math.to_number(rule2) + math.to_number(rule3) >= 2 condition:
# add weight / score to rule
math.to_number(rule1) * 20 + math.to_number(rule2) * 30 + math.to_number(rule3) * 30 >= 60 the problem is all rules must be matched at first, yara can not optimize this expression to stop earlier (for example, the first demo, if rule1 and rule2 can be matched, it's not necessary to run rule3) |
You are correct that this would work but there are two reasons this PR is a good idea. First, "2 of (rule1, rule2, rule3)" is a much more concise expression. Second, this supports wildcards so you can say "2 of (foo*)" and never have to update your condition if you add another rule. Your point about weights is valid, and if you want to weight one rule more than another you need to use your approach. |
This is really nice, but I have a concern that I'm not sure how to solve. When you have a condition like You are currently handling the case in which a rule with condition I guess the only solution for that is making the order in which rules are defined irrelevant, but that probably means more important changes, and introduces new situations that must be detected, like cyclical dependencies among rules. I'm not sure how to tackle this. |
That's a really good point I had not thought of. Let me think about this a bit and maybe I can come up with a solution. |
You're right that doing anything at runtime is going to open up new situations to deal with. Instead, I think this can be turned into a compiler error. The rough plan I am going to try is:
This would recognize that you're using a rule name that is covered by an identifier with a wildcard AFTER the rule with the wildcarded identifier is defined. I could turn this into an error so that the rules will not compile unless the order is correct. |
That sounds like a good compromise. |
Make sure that when we compile a new rule that any existing rule sets do not match the new rule. Specifically we are making the following a compiler error: rule a1 { condition: true } rule b { condition: 1 of (a*) } rule a2 { condition: true } This is a compiler error because "a2" is defined after the usage of (a*) in rule b. There is still an edge case here. These rules compile: rule a { condition 1 of (a) } rule b { condition: 1 of (b*) } In both cases those rules evaluate to false, which is in line with the behavior of this rule: rule c { condition: c } Given that the behavior around rule sets matches the current behavior I think this is acceptable.
I implemented the strict checking in the compiler. I ended up using a hash table to allow me to quickly check if a wildcard identifier has been processed already (so I can avoid adding it to the list). I'm not sure this is worth the tradeoff in memory usage during compilation, but I figure it was worth it so I can do quick lookups as I can see rule sets being used fairly often in. |
What if instead of having both a linked list and a hash table, we implement a mechanism for iterating all the items in a hash table? You are using the linked list for storing all the wildcard identifiers, while the hash table is used only for avoiding duplicate entries in the linked list, but the hash table already contains all you need, it's just that we don't have a way of iterating over all the items in the hash table. In this case the hash table is just a collection of linked list of |
I thought about doing that but decided not to since I wasn't sure it would be worth it for this one use case. I'll add it to this PR soon (currently working on addressing concerns in my outstanding gyp PR). |
This adds the ability to use rule sets just like we do with string sets. Specifically, it enables conditions like the following: 5 of (a, b) 50% of (a*) where "a" is an existing rule and "b*" is a wildcard rule identifier, which is expanded out precisely as wildcard string identifiers are. I chose to make OP_OF and OP_OF_PERCENT emit an argument which is the "type" of set it is expected to match on. The two acceptable values are OF_STRING_SET and OF_RULE_SET. When OF_STRING_SET is used then the references on the stack are expected to be strings and OF_RULE_SET will check if the rules on the stack match. I added a yr_parser_emit_pushes_for_rules() which behaves similarly to the string counterpart. The one big exception is that I can't use yr_rules_foreach macro since that needs the terminating "null" rule which is not yet added, so instead I have to iterate the rules manually using yr_arena_get_ptr() to walk all the currently compiled rules. * Stricter ordering of rule set in compiler. Make sure that when we compile a new rule that any existing rule sets do not match the new rule. Specifically we are making the following a compiler error: rule a1 { condition: true } rule b { condition: 1 of (a*) } rule a2 { condition: true } This is a compiler error because "a2" is defined after the usage of (a*) in rule b. There is still an edge case here. These rules compile: rule a { condition 1 of (a) } rule b { condition: 1 of (b*) } In both cases those rules evaluate to false, which is in line with the behavior of this rule: rule c { condition: c } Given that the behavior around rule sets matches the current behavior I think this is acceptable. * Add docs for rule sets. * Implement hash table iterators. * Remove old comment about linked list. * Move variable declaration. * Iterate over rules better.
This adds the ability to use rule sets just like we do with string sets. Specifically, it enables conditions like the following: 5 of (a, b) 50% of (a*) where "a" is an existing rule and "b*" is a wildcard rule identifier, which is expanded out precisely as wildcard string identifiers are. I chose to make OP_OF and OP_OF_PERCENT emit an argument which is the "type" of set it is expected to match on. The two acceptable values are OF_STRING_SET and OF_RULE_SET. When OF_STRING_SET is used then the references on the stack are expected to be strings and OF_RULE_SET will check if the rules on the stack match. I added a yr_parser_emit_pushes_for_rules() which behaves similarly to the string counterpart. The one big exception is that I can't use yr_rules_foreach macro since that needs the terminating "null" rule which is not yet added, so instead I have to iterate the rules manually using yr_arena_get_ptr() to walk all the currently compiled rules. * Stricter ordering of rule set in compiler. Make sure that when we compile a new rule that any existing rule sets do not match the new rule. Specifically we are making the following a compiler error: rule a1 { condition: true } rule b { condition: 1 of (a*) } rule a2 { condition: true } This is a compiler error because "a2" is defined after the usage of (a*) in rule b. There is still an edge case here. These rules compile: rule a { condition 1 of (a) } rule b { condition: 1 of (b*) } In both cases those rules evaluate to false, which is in line with the behavior of this rule: rule c { condition: c } Given that the behavior around rule sets matches the current behavior I think this is acceptable. * Add docs for rule sets. * Implement hash table iterators. * Remove old comment about linked list. * Move variable declaration. * Iterate over rules better.
This adds the ability to use rule sets just like we do with string sets.
Specifically, it enables conditions like the following:
5 of (a, b)
50% of (a*)
where "a" is an existing rule and "b*" is a wildcard rule identifier, which is
expanded out precisely as wildcard string identifiers are.
I chose to make OP_OF and OP_OF_PERCENT emit an argument which is the "type" of
set it is expected to match on. The two acceptable values are OF_STRING_SET and
OF_RULE_SET. When OF_STRING_SET is used then the references on the stack are
expected to be strings and OF_RULE_SET will check if the rules on the stack
match.
I added a yr_parser_emit_pushes_for_rules() which behaves similarly to the
string counterpart. The one big exception is that I can't use yr_rules_foreach
macro since that needs the terminating "null" rule which is not yet added, so
instead I have to iterate the rules manually using yr_arena_get_ptr() to walk
all the currently compiled rules.