-
-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add dataset support #361
add dataset support #361
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A perfect addition missing in modsecurity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
ahocorasick "github.com/petar-dambovaliev/aho-corasick" | ||
) | ||
|
||
// TODO according to coraza researchs, re2 matching is faster than ahocorasick |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we link the benchmarks or show results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should remove this comment, it comes from v1
} | ||
|
||
func (o *pmFromDataset) Init(options coraza.RuleOperatorOptions) error { | ||
data := options.Arguments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to be single value always?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could accept multiple datasets if required, but I don't see much value
if tx.Capture { | ||
matches := o.matcher.FindAll(value) | ||
for i, match := range matches { | ||
if i == 10 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value seems arbitrary (and kind of repetitive). Maybe we want to abstract it, at least in this package?
Line 50 in a1529ab
if i == 10 { coraza/operators/pm_from_file.go
Line 60 in a1529ab
if i == 10 { - https://github.com/corazawaf/coraza/blob/v3/dev/operators/validate_nid.go#L50
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it's part of our technical debt, we should add the constant somewhere. 10 is the documented standard though.
func directiveSecDataset(options *DirectiveOptions) error { | ||
spl := strings.SplitN(options.Opts, " ", 2) | ||
if len(spl) != 2 { | ||
return errors.New("syntax error: SecDataset name `\n...\n`") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error is a bit unclear to me. What does \n...\n
mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be printed as:
ERROR: Directive SecDataset - syntax error: SecDataset name `
...
`
It is hard to represent the syntax as a single line.
} | ||
name := spl[0] | ||
if _, ok := options.Datasets[name]; ok { | ||
options.Waf.Logger.Warn("Dataset %s already exists, overwriting", name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather use %q
, it is handy when debugging and an empty space is the cause of the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we should normalize using %q for external inputs. We should document it as technical debt
if _, ok := options.Datasets[name]; ok { | ||
options.Waf.Logger.Warn("Dataset %s already exists, overwriting", name) | ||
} | ||
arr := []string{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this good enough @anuraaga? Is it worth to work out as a best effort a capacity or since this only happens on bootstrap it is OK to use with no length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should care much about optimizations for the seclang package and it is already way faster than modsecurity and it doesn't happen that often.
Still, we should track a seclang optimization in the future.
@@ -181,3 +181,19 @@ func TestInvalidRulesWithIgnoredErrors(t *testing.T) { | |||
t.Error("failed to error on invalid rule") | |||
} | |||
} | |||
|
|||
func TestSecDataset(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to include more tests here. I guess the char "`" isn't supported in the dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"`" is supported in the dataset, the code only interprets it if it is the last character of the directive declaration or the only character in the line.
for scanner.Scan() { | ||
p.currentLine++ | ||
line := scanner.Text() | ||
linebuffer += strings.TrimSpace(line) | ||
line := strings.TrimSpace(scanner.Text()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change requires a unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CRS tests are passing, but we should definitely add more tests for this. It is a major change for the seclang parser.
Thanks for this PR @jptosso I wonder what happens when a dataset has duplicated values, e.g.
Do we still add duplicated entries or we keep it unique? also, what happens with case insensitive dups:
|
DFA: true, | ||
}) | ||
|
||
// TODO this operator is supposed to support snort data syntax: "@pmFromDataset A|42|C|44|F" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think this TODO is misleading, snort syntax would end up being inside the Dataset, wouldn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, we should keep it only for PM.
Regarding duplications, it depends on the operator, the dictionary will store everything case-sensitive. The aho-corasick implementation will read it case-insensitive and might probably remove duplicates.
Still, a duplicate shouldn't affect results, just performance.
WASM support is an essential feature of Coraza v3, but users cannot fully enjoy its potential because of file reading limitations. For this reason,
SecDataset
is a decent replacement for .data files. It's also easier to watch files for reloading on .conf files.Compatible operators are:
Messing with the seclang (apache directives) syntax will break regression for modsecurity compatible engines.
Documentation
SecDataset
Description: Emulates .data files by inserting multiple strings, line by line. Operators such as @pmFromDataset can use these datasets as the source of dictionary matching.
Syntax: SecDataset DATASET_NAME `...`
Sample:
Important:
pmFromDataset
Description: Performs a case-insensitive match of the provided dataset against the desired input value. The operator uses a set-based matching algorithm (Aho-Corasick), which means that it will match any number of keywords in parallel. When matching of a large number of keywords is needed, this operator performs much better than a regular expression.
This operator is the same as @pm, except that it takes a list of files as arguments. It will match any one of the phrases listed in the file(s) anywhere in the target value.
References
CC @M4tteoP