Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add regex/parse command to PPL #359

Closed
joshuali925 opened this issue Jan 4, 2022 · 0 comments
Closed

Add regex/parse command to PPL #359

joshuali925 opened this issue Jan 4, 2022 · 0 comments
Assignees
Labels
enhancement New feature or request v1.3.0

Comments

@joshuali925
Copy link
Member

joshuali925 commented Jan 4, 2022

Overview

Currently, PPL could query the field defined at index time (schema-on-write), PPL also support define the field at query time (schema-on-read) through eval. It would also be helpful to be able to extract fields from a text field using regex expression. For example, at index time, instead of extract the fields from the raw log message, users may choose to index all the raw log message as a text field. During the query time, they could use regex on-the-fly to extract any field they want and do further processing.

Requirements

  • Implement parse command
E | parse <text-field> <regex-expression>
  regex-expression: regex string with named capture groups '(?<namedGroup>regex)'
  • Grammar

    • E: the input event stream.
    • expression: An expression that evaluates to a string, this is usually the raw text field.
    • namedGroup: The name of a field to assign a value to, extracted from the string expression.

Example

# raw_field
223.87.60.27 - - [2018-07-22T00:39:02.912Z] "GET /opensearch/opensearch-1.0.0.deb_1 HTTP/1.1" 200 6219 "-" "Mozilla/5.0 (X11; Linux x86_64; rv:6.0a1) Gecko/20110421 Firefox/6.0a1"

# PPL
source = index | parse raw_field '.*?"-" "(?<userAgent>[^"]+)"'
  • Functionality: Fields extracted by the parse command at search time could be used as same as fields defined at index time.
  • Security: The access to the string expression should follow the FGAC field access policy. The field extracted at the search time doesn’t follow the FGAC field access policy, the reason is if the client has been authorized to access the string expression, it should be authorized to access the field extract from the string expression.
  • Performance: The parse command should have comparable performance as runtime search request.

Field type conversion

Derived fields are string by default. They can be casted using the cast function.

Performance

Currently the regex evaluation happens in the coordinating node. For the first phase we will use circuit breakers to limit query execution time, and for the future we will work to support distributed queries.

Additionally parse command uses java.util.regex, we can evaluate if it helps to switch to optimized third party libraries like re2j. re2j is linear time and better at handling untrusted regex expressions. It is not Perl regex compatible and doesn't support things like lookarounds, but those might not be as important for PPL regex.

Limitations on current implementation

See https://github.com/opensearch-project/sql/blob/578c6b27bf3eb6707f3c5927783bd4f6a0de71c0/docs/user/ppl/cmd/parse.rst#limitation

Demo (type conversion syntax has changed)

Screen.Recording.2022-01-07.at.11.31.56.AM.mov
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request v1.3.0
Projects
None yet
Development

No branches or pull requests

2 participants