add dataset support #361

jptosso · 2022-08-26T03:43:05Z

WASM support is an essential feature of Coraza v3, but users cannot fully enjoy its potential because of file reading limitations. For this reason, SecDataset is a decent replacement for .data files. It's also easier to watch files for reloading on .conf files.

Compatible operators are:

ipMatchFromDataset: TBI
pmFromDataset: Implemented

Messing with the seclang (apache directives) syntax will break regression for modsecurity compatible engines.

Documentation

SecDataset

Description: Emulates .data files by inserting multiple strings, line by line. Operators such as @pmFromDataset can use these datasets as the source of dictionary matching.

Syntax: SecDataset DATASET_NAME `...`

Sample:

SecDataset sample_dataset `
word1
word2
word3
`

SecRule REQUEST_URI "@pmFromDataset sample_dataset" "...msg:'Match sample_dataset'"

Important:

Comment lines (#...) are ignored
Lines are stripped of leading and trailing whitespaces
Empty lines are ignored
Datasets are case-sensitive, but operators like pmFromDataset will use them as case-insensitive (lowercase)
Unicode characters are supported
Grave accents (`) are required at the end of the directive declaration and the line after the last word
If a dataset is not closed, the compiler will fail to read the following directives

pmFromDataset

Description: Performs a case-insensitive match of the provided dataset against the desired input value. The operator uses a set-based matching algorithm (Aho-Corasick), which means that it will match any number of keywords in parallel. When matching of a large number of keywords is needed, this operator performs much better than a regular expression.

This operator is the same as @pm, except that it takes a list of files as arguments. It will match any one of the phrases listed in the file(s) anywhere in the target value.

References

CC @M4tteoP

piyushroshan

A perfect addition missing in modsecurity

syinwu

LGTM

jcchavezs · 2022-08-26T10:14:01Z

operators/pm_from_dataset.go

+	ahocorasick "github.com/petar-dambovaliev/aho-corasick"
+)
+
+// TODO according to coraza researchs, re2 matching is faster than ahocorasick


Shall we link the benchmarks or show results?

we should remove this comment, it comes from v1

jcchavezs · 2022-08-26T10:14:52Z

operators/pm_from_dataset.go

+}
+
+func (o *pmFromDataset) Init(options coraza.RuleOperatorOptions) error {
+	data := options.Arguments


Is this going to be single value always?

We could accept multiple datasets if required, but I don't see much value

jcchavezs · 2022-08-26T10:17:23Z

operators/pm_from_dataset.go

+	if tx.Capture {
+		matches := o.matcher.FindAll(value)
+		for i, match := range matches {
+			if i == 10 {


This value seems arbitrary (and kind of repetitive). Maybe we want to abstract it, at least in this package?

coraza/operators/pm.go

Line 50 in a1529ab

if i == 10 {

coraza/operators/pm_from_file.go

Line 60 in a1529ab

if i == 10 {

https://github.com/corazawaf/coraza/blob/v3/dev/operators/validate_nid.go#L50

yes, it's part of our technical debt, we should add the constant somewhere. 10 is the documented standard though.

jcchavezs · 2022-08-26T10:22:39Z

seclang/directives.go

+func directiveSecDataset(options *DirectiveOptions) error {
+	spl := strings.SplitN(options.Opts, " ", 2)
+	if len(spl) != 2 {
+		return errors.New("syntax error: SecDataset name `\n...\n`")


The error is a bit unclear to me. What does \n...\n mean?

It will be printed as:

ERROR: Directive SecDataset - syntax error: SecDataset name ` ... `

It is hard to represent the syntax as a single line.

jcchavezs · 2022-08-26T10:23:32Z

seclang/directives.go

+	}
+	name := spl[0]
+	if _, ok := options.Datasets[name]; ok {
+		options.Waf.Logger.Warn("Dataset %s already exists, overwriting", name)


I'd rather use %q, it is handy when debugging and an empty space is the cause of the problem.

Sure, we should normalize using %q for external inputs. We should document it as technical debt

jcchavezs · 2022-08-26T10:24:50Z

seclang/directives.go

+	if _, ok := options.Datasets[name]; ok {
+		options.Waf.Logger.Warn("Dataset %s already exists, overwriting", name)
+	}
+	arr := []string{}


Is this good enough @anuraaga? Is it worth to work out as a best effort a capacity or since this only happens on bootstrap it is OK to use with no length.

I don't think we should care much about optimizations for the seclang package and it is already way faster than modsecurity and it doesn't happen that often.

Still, we should track a seclang optimization in the future.

jcchavezs · 2022-08-26T10:27:18Z

seclang/directives_test.go

@@ -181,3 +181,19 @@ func TestInvalidRulesWithIgnoredErrors(t *testing.T) {
 		t.Error("failed to error on invalid rule")
 	}
 }
+
+func TestSecDataset(t *testing.T) {


We need to include more tests here. I guess the char "`" isn't supported in the dataset.

"`" is supported in the dataset, the code only interprets it if it is the last character of the directive declaration or the only character in the line.

jcchavezs · 2022-08-26T10:27:43Z

seclang/parser.go

 	for scanner.Scan() {
 		p.currentLine++
-		line := scanner.Text()
-		linebuffer += strings.TrimSpace(line)
+		line := strings.TrimSpace(scanner.Text())


This change requires a unit test.

CRS tests are passing, but we should definitely add more tests for this. It is a major change for the seclang parser.

jcchavezs · 2022-08-26T10:30:22Z

Thanks for this PR @jptosso I wonder what happens when a dataset has duplicated values, e.g.

jptosso
piyushroshan
bxlxx
M4tteoP
bxlxx

Do we still add duplicated entries or we keep it unique? also, what happens with case insensitive dups:

jptosso
m4tteop
piyushroshan
bxlxx
M4tteoP

M4tteoP · 2022-08-26T10:38:35Z

operators/pm_from_dataset.go

+		DFA:                  true,
+	})
+
+	// TODO this operator is supposed to support snort data syntax: "@pmFromDataset A|42|C|44|F"


nit: I think this TODO is misleading, snort syntax would end up being inside the Dataset, wouldn't it?

Yep, we should keep it only for PM.

Regarding duplications, it depends on the operator, the dictionary will store everything case-sensitive. The aho-corasick implementation will read it case-insensitive and might probably remove duplicates.

Still, a duplicate shouldn't affect results, just performance.

add dataset support

c7ec811

jptosso requested review from syinwu, fzipi, jcchavezs and piyushroshan August 26, 2022 03:43

piyushroshan approved these changes Aug 26, 2022

View reviewed changes

syinwu approved these changes Aug 26, 2022

View reviewed changes

syinwu merged commit b04e185 into v3/dev Aug 26, 2022

syinwu deleted the seclang-datasets branch August 26, 2022 08:07

jcchavezs reviewed Aug 26, 2022

View reviewed changes

M4tteoP reviewed Aug 26, 2022

View reviewed changes

M4tteoP mentioned this pull request Aug 26, 2022

Implements ipMatchFromDataset, parsing for ipMatchFromFile #363

Merged

3 tasks

jptosso mentioned this pull request Aug 31, 2022

Feature: Pm custom separator #349

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add dataset support #361

add dataset support #361

jptosso commented Aug 26, 2022 •

edited

Loading

piyushroshan left a comment

syinwu left a comment

jcchavezs Aug 26, 2022

jptosso Aug 26, 2022

jcchavezs Aug 26, 2022

jptosso Aug 26, 2022

jcchavezs Aug 26, 2022

jptosso Aug 26, 2022

jcchavezs Aug 26, 2022

jptosso Aug 26, 2022 •

edited

Loading

jcchavezs Aug 26, 2022

jptosso Aug 26, 2022

jcchavezs Aug 26, 2022

jptosso Aug 26, 2022

jcchavezs Aug 26, 2022

jptosso Aug 26, 2022 •

edited

Loading

jcchavezs Aug 26, 2022

jptosso Aug 26, 2022 •

edited

Loading

jcchavezs commented Aug 26, 2022

M4tteoP Aug 26, 2022

jptosso Aug 26, 2022

add dataset support #361

add dataset support #361

Conversation

jptosso commented Aug 26, 2022 • edited Loading

Documentation

SecDataset

pmFromDataset

References

piyushroshan left a comment

Choose a reason for hiding this comment

syinwu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jptosso Aug 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jptosso Aug 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jptosso Aug 26, 2022 • edited Loading

Choose a reason for hiding this comment

jcchavezs commented Aug 26, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jptosso commented Aug 26, 2022 •

edited

Loading

jptosso Aug 26, 2022 •

edited

Loading

jptosso Aug 26, 2022 •

edited

Loading

jptosso Aug 26, 2022 •

edited

Loading