Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add dataset support #361

Merged
merged 1 commit into from
Aug 26, 2022
Merged

add dataset support #361

merged 1 commit into from
Aug 26, 2022

Conversation

jptosso
Copy link
Member

@jptosso jptosso commented Aug 26, 2022

WASM support is an essential feature of Coraza v3, but users cannot fully enjoy its potential because of file reading limitations. For this reason, SecDataset is a decent replacement for .data files. It's also easier to watch files for reloading on .conf files.

Compatible operators are:

  • ipMatchFromDataset: TBI
  • pmFromDataset: Implemented

Messing with the seclang (apache directives) syntax will break regression for modsecurity compatible engines.

Documentation

SecDataset

Description: Emulates .data files by inserting multiple strings, line by line. Operators such as @pmFromDataset can use these datasets as the source of dictionary matching.

Syntax: SecDataset DATASET_NAME `...`

Sample:

SecDataset sample_dataset `
word1
word2
word3
`

SecRule REQUEST_URI "@pmFromDataset sample_dataset" "...msg:'Match sample_dataset'"

Important:

  • Comment lines (#...) are ignored
  • Lines are stripped of leading and trailing whitespaces
  • Empty lines are ignored
  • Datasets are case-sensitive, but operators like pmFromDataset will use them as case-insensitive (lowercase)
  • Unicode characters are supported
  • Grave accents (`) are required at the end of the directive declaration and the line after the last word
  • If a dataset is not closed, the compiler will fail to read the following directives

pmFromDataset

Description: Performs a case-insensitive match of the provided dataset against the desired input value. The operator uses a set-based matching algorithm (Aho-Corasick), which means that it will match any number of keywords in parallel. When matching of a large number of keywords is needed, this operator performs much better than a regular expression.

This operator is the same as @pm, except that it takes a list of files as arguments. It will match any one of the phrases listed in the file(s) anywhere in the target value.

References

CC @M4tteoP

Copy link
Contributor

@piyushroshan piyushroshan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A perfect addition missing in modsecurity

Copy link
Member

@syinwu syinwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@syinwu syinwu merged commit b04e185 into v3/dev Aug 26, 2022
@syinwu syinwu deleted the seclang-datasets branch August 26, 2022 08:07
ahocorasick "github.com/petar-dambovaliev/aho-corasick"
)

// TODO according to coraza researchs, re2 matching is faster than ahocorasick
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we link the benchmarks or show results?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should remove this comment, it comes from v1

}

func (o *pmFromDataset) Init(options coraza.RuleOperatorOptions) error {
data := options.Arguments
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to be single value always?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could accept multiple datasets if required, but I don't see much value

if tx.Capture {
matches := o.matcher.FindAll(value)
for i, match := range matches {
if i == 10 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value seems arbitrary (and kind of repetitive). Maybe we want to abstract it, at least in this package?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's part of our technical debt, we should add the constant somewhere. 10 is the documented standard though.

func directiveSecDataset(options *DirectiveOptions) error {
spl := strings.SplitN(options.Opts, " ", 2)
if len(spl) != 2 {
return errors.New("syntax error: SecDataset name `\n...\n`")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error is a bit unclear to me. What does \n...\n mean?

Copy link
Member Author

@jptosso jptosso Aug 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be printed as:

ERROR: Directive SecDataset - syntax error: SecDataset name `
...
`

It is hard to represent the syntax as a single line.

}
name := spl[0]
if _, ok := options.Datasets[name]; ok {
options.Waf.Logger.Warn("Dataset %s already exists, overwriting", name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather use %q, it is handy when debugging and an empty space is the cause of the problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we should normalize using %q for external inputs. We should document it as technical debt

if _, ok := options.Datasets[name]; ok {
options.Waf.Logger.Warn("Dataset %s already exists, overwriting", name)
}
arr := []string{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this good enough @anuraaga? Is it worth to work out as a best effort a capacity or since this only happens on bootstrap it is OK to use with no length.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should care much about optimizations for the seclang package and it is already way faster than modsecurity and it doesn't happen that often.

Still, we should track a seclang optimization in the future.

@@ -181,3 +181,19 @@ func TestInvalidRulesWithIgnoredErrors(t *testing.T) {
t.Error("failed to error on invalid rule")
}
}

func TestSecDataset(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to include more tests here. I guess the char "`" isn't supported in the dataset.

Copy link
Member Author

@jptosso jptosso Aug 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"`" is supported in the dataset, the code only interprets it if it is the last character of the directive declaration or the only character in the line.

for scanner.Scan() {
p.currentLine++
line := scanner.Text()
linebuffer += strings.TrimSpace(line)
line := strings.TrimSpace(scanner.Text())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change requires a unit test.

Copy link
Member Author

@jptosso jptosso Aug 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRS tests are passing, but we should definitely add more tests for this. It is a major change for the seclang parser.

@jcchavezs
Copy link
Member

Thanks for this PR @jptosso I wonder what happens when a dataset has duplicated values, e.g.

jptosso
piyushroshan
bxlxx
M4tteoP
bxlxx

Do we still add duplicated entries or we keep it unique? also, what happens with case insensitive dups:

jptosso
m4tteop
piyushroshan
bxlxx
M4tteoP

DFA: true,
})

// TODO this operator is supposed to support snort data syntax: "@pmFromDataset A|42|C|44|F"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think this TODO is misleading, snort syntax would end up being inside the Dataset, wouldn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, we should keep it only for PM.

Regarding duplications, it depends on the operator, the dictionary will store everything case-sensitive. The aho-corasick implementation will read it case-insensitive and might probably remove duplicates.

Still, a duplicate shouldn't affect results, just performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants